Skip to content

Update htcondor instructions#459

Open
jhiemstrawisc wants to merge 12 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:update-htcondor-instructions
Open

Update htcondor instructions#459
jhiemstrawisc wants to merge 12 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:update-htcondor-instructions

Conversation

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator

This largely reformats the directory structure needed to run SPRAS workflows with HTCondor. In particular, it moves a lot of the helper code/submit files out of docker-wrappers/SPRAS/ into a top-level htcondor/ directory. I can do this now that the HTCondor executor has matured significantly, and can handle all the paths as they're configured in this diff.

To run a test SPRAS workflow, try following along with the instructions in docs/htcondor.rst. If anything is confusing, or you get hung up on any of the steps, let's discuss what I can do to make things more clear.

@jhiemstrawisc jhiemstrawisc requested a review from agitter January 23, 2026 21:56
@read-the-docs-community

read-the-docs-community Bot commented Jan 23, 2026

Copy link
Copy Markdown

Documentation build overview

📚 spras | 🛠️ Build #33204015 | 📁 Comparing eae9f2e against latest (d990664)

  🔍 Preview build  

3 files changed
± htcondor.html
± install.html
± fordevs/spras.html

@tristan-f-r tristan-f-r left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I'll have to restore my HTCondor access to follow this.]

Comment thread htcondor/spras_profile/config.yaml
Comment thread docs/htcondor.rst Outdated
Comment thread run_htcondor.sh Outdated

@tristan-f-r tristan-f-r left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code nitpicks

Comment thread htcondor/snakemake_long.py Outdated
Comment thread htcondor/snakemake_long.py Outdated
@tristan-f-r tristan-f-r added the documentation Improvements or additions to documentation label Jan 24, 2026

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing the Snakemake long execution mode. The first time my jobs went on hold because I put my spras-v0.6.0.sif file in the htcondor/ directory instead of the root directory. That should have been obvious based on the comment in the .yaml file.

On the second attempt my jobs went on hold with

Transfer output files failure at execution point slot1_24@e2591.chtc.wisc.edu while sending files to access point ap2001. Details: 1 total failures: first failure: reading from file /var/lib/condor/execute/slot1/dir_3699332/scratch/output: (errno 2) No such file or directory

Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread htcondor/spras.sub Outdated
Comment thread htcondor/spras.sub
@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

I converted this to a draft because these docs will depend on the explicit sif transfer PR, and I haven't yet tested everything here in that paradigm.

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

Also, apologies for the poor git etiquette in the last commit that rolled too many things into one diff (including running an rst formatter). I think some of the updates are sufficiently large that the whole file should more or less be re-assessed as a fresh document.

@jhiemstrawisc jhiemstrawisc force-pushed the update-htcondor-instructions branch from 178a93e to fe7fbbc Compare April 8, 2026 21:17
@github-actions github-actions Bot added the merge-conflict This PR has merge conflicts. label Apr 17, 2026
jhiemstrawisc and others added 6 commits June 8, 2026 10:25
…gging

I was tired of hacking around wanting verbose logging in the HTCondor
Snakemake executor, so I added some plumbing to pass Snakemake's
'--verbose' flag through 'snakemake_long.py' to snakemake itself.

Additionally, I added '--env-manager' so I could run things with my
preferred mamba env instead of conda (which is too slow to rebuild).
The executor has matured quite a bit since these instructions were
first drafted, and it's my hope that these changes remove a lot of
the headache for running jobs.

Now, you can edit config files in `config/` and use the `input/`
directory directly. Workflows should be submitted directly from the
repository root.
Co-authored-by: Tristan F.-R. <pub.tristanf@gmail.com>
@jhiemstrawisc jhiemstrawisc force-pushed the update-htcondor-instructions branch from fe7fbbc to ceea753 Compare June 8, 2026 15:46
@github-actions github-actions Bot removed the merge-conflict This PR has merge conflicts. label Jun 8, 2026
These came from testing Neha's real workflow in June 2026. Not totally
sure how they all work (and whether additional environment variables
will need to be added in the future), but they were key to getting
custom sif images to unpack alongside the jobs.
@jhiemstrawisc jhiemstrawisc marked this pull request as ready for review June 8, 2026 21:47
@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

This is a note for myself -- one thing I should document in the htcondor rst is the need to pre-create apptainer images before launching workflows.

@ntalluri ntalluri self-requested a review June 9, 2026 17:54
Add guidance to docs/htcondor.rst encouraging users to pre-build
per-algorithm container images rather than pulling them at runtime,
and steer them toward the proper place to build those images.

Also add a warning against running `apptainer build` directly on a
shared Access Point, pointing users to CHTC's guide for building
images in an interactive job.
@jhiemstrawisc jhiemstrawisc requested a review from agitter June 9, 2026 20:28

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my testing, I triggered a Snakemake lock error by launching a long job, killing it, changing the config file, and relaunching. That may be a common error.

python3.11/site-packages/snakemake/persistence.py", line 211, in lock
    raise snakemake.exceptions.LockException()
snakemake.exceptions.LockException: Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following director
y:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.

LockException:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.

Should we add it to troubleshooting?

I also hit this error

$ cat htcondor/logs/merge_input/merge_input-5_7645955.err
ModuleNotFoundError in file "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8:
No module named 'spras.config.revision'
  File "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8, in <module>

I'm guessing that means a need a newer version of the SPRAS sif image. However, we haven't released a SPRAS version recently. What version of the image are you testing with?

@ntalluri

ntalluri commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

@agitter

For the first bug, this requires the spras conda environment to be activated and then the command snakemake --configfile <path to config file> --unlock to be run anytime the config file of choice is updated. I was planning on commenting this as step five for the parallel jobs.

For the second issue, you are right, this is because the version of SPRAS in the docker image v0.6 isn't up to date with the current version of SPRAS. Justin has a docker image you can pull from dockerhub (i think it is this jhiemstra/spras:update-htcondor-instructions-v2) or you will need to build the image with Docker on your local machine of the updated version of SPRAS, push the image to Docker Hub, and then use that image.

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

For the second issue, you are right, this is because the version of SPRAS in the docker image v0.6 isn't up to date with the current version of SPRAS.

In general, you should either:

  • always rebuild the SPRAS container to match the version repo you're working with, OR
  • check out the repo at a specific release (e.g. git checkout 0.6.0) to match the container you want to use

The key is that the repo you're using to submit from the AP should match what's in the image.

There's a callout relatively early in the documentation covering this, but I'm open to edits it seems like this is often missed:

   It is best practice to make sure that the Snakefile you copy for your
   workflow is the same version as the Snakefile baked into your
   workflow's container image. When this workflow runs, the Snakefile
   you just copied will be used during remote execution instead of the
   Snakefile from the container. As a result, difficult-to-diagnose
   versioning issues may occur if the version of SPRAS in the remote
   container doesn't support the Snakefile on your current branch. The
   safest bet is always to create your own image so you always know
   what's inside of it.

@ntalluri ntalluri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated documentation looks great, I added some suggestions on how to help users more.

I also remember we were changing the config.yaml file in spras_profile and wasn't sure if any of the commands needed to be added to the documentation.

Comment thread docs/htcondor.rst
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst
Comment thread docs/htcondor.rst Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what my config.yaml looks like. I wasn't sure if we need to add:

... && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)'
stream_ouput: true
stream_error: true

parts to the documentation

# Default configuration for the SPRAS/HTCondor executor profile. Each of these values
# can also be passed via command line flags, e.g. `--jobs 30 --executor htcondor`.

# NOTE: File paths in here should be relative to where you submit from, typically the
# root of the SPRAS repository

# 'jobs' specifies the maximum number of HTCondor jobs that can be in the queue at once.
jobs: 30
executor: htcondor
configfile: config/egfr.yaml
htcondor-jobdir: htcondor/logs

# Indicate to the plugin that jobs running on various EPs do not share a filesystem with
# each other, or with the AP.
shared-fs-usage: none
# Distributed, heterogeneous computational environments are a wild place where strange things
# can happen. If something goes wrong, try again up to 2 times. After that, we assume there's
# a real error that requires user/admin intervention
retries: 2

# Default resources will apply to all workflow steps. If a single workflow step fails due
# to insufficient resources, it can be re-run with modified values. Snakemake will handle
# picking up where it left off, and won't re-run steps that have already completed.
default-resources:
  job_wrapper: "htcondor/spras.sh"
  # If running in CHTC, this only works with apptainer images
  # Note requirement for quotes around the image name
  container_image: "test-htc.sif"
  universe: "container"
  # The value for request_disk should be large enough to accommodate the runtime container
  # image, any additional PRM container images, and your input data.
  request_disk: "16GB"
  request_memory: "12GB"
  retry_request_memory_increase: "RequestMemory + 4"
  retry_request_memory_max: "32GB"
  classad_WantGlideIn: true
  requirements: |
    '(HAS_SINGULARITY == True) && (Poolname =!= "CHTC") && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)' 
  stream_ouput: true
  stream_error: true

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't think we should be covering the stream_{output/error} in the general purpose profile for fear that an unknowing user will take that as the default (and make a CHTC sys admin very sad when they crash an AP). These should be intentionally hard to fine because users will find them very tempting to use without understanding the detrimental effects they can have on shared computing resources.

As for the other requirements, I'd also like to avoid sticking those in the profile -- they're very specific to your run that needs both the OSPool and profiling, and these requirements are generally documented elsewhere.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the goal of these settings for Neha's OSPool runs? I don't think we need to add them here but am curious for our SPRAS benchmarking in OSPool.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • classad_WantGlideIn: true configures a classad that enables submission to OSPool
  • (HAS_SINGULARITY == True) makes sure you land on OSPool EPs that support apptainer/singularity, which is a hard requirement for SPRAS
  • (Poolname =!= "CHTC") -- this can probably be omitted -- it disables submission to CHTC EPs. If you enable OSPool submissions and don't include this, you're submitting to both the OSPool and CHTC at once.
  • versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true) makes sure you match with EPs that have the features/configuration needed for the apptainer profiling code to work.

Comment thread docs/htcondor.rst
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread htcondor/spras_profile/config.yaml
Comment thread docs/htcondor.rst
Comment thread docs/htcondor.rst
- ✓
- Convenience wrapper (in the repository root) around
``snakemake_long.py``.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next section is what I found confusing. It gives instructions to create the .sif from the existing DockerHub image. That usually breaks. I recommend we remove it and only give instructions to build a new image from source.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can explain how these steps break for you? I don't think I've ever run into issues building from an existing Dockerhub image, outside of the mismatched Snakefile conundrum (which I try to cover more heavily in my latest round of revisions).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new option to checkout a version of the SPRAS repo that matches the existing image makes sense and should help.

My prior confusion was around what a user should do if they need to build a SPRAS image themselves to match a newer commit. I don't see a way to build the Apptainer sif image entirely on CHTC. We link to CHTC docs that expect you to have a def file, which we don't have. The Apptainer instructions for building directly from a Dockerfile didn't work for me. My understanding is that I would have to build a Docker image on a local machine with Docker, push it to my DockerHub, then run a build job in CHTC to convert it to a sif file.

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ready to me. I'll let Neha check it as well.

@ntalluri ntalluri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants